# Initial setup
# Env Variables
CadasPath = "cadas"
CadasPackagePath = file.path(CadasPath, "R")
PersonalPackagePath = "~/scripts/packages/"
source(file.path(CadasPath, "tmp", "sourcedFiles.R"))
load(file.path(CadasPath, "config", "config.v2chemistry.fastalignment.2016-02-13.RData"))
load(file.path(CadasPath, "tmp", 'sampledata.RData'))
# Packages
library(magrittr)
library(knitr)
library(MASS)
library(scales)
library(parallel)
library(gdata)
library(dplyr)
library(devtools)
# Mike's personal utility package
load_all(file.path(PersonalPackagePath, "MehanGlobals"))
# source functions for this project, using source from my profile.
# Replace with base R source if compiling outside my environment
source.file("scripts/main.R", env.name = 'global_shadow')
A new methodology for detecting fetal aneuploidy was published recently that does not require a reference control population (Sun, 2017). This approach used a paired Mann-Whitney test to detect the increase in shorter reads relative to long reads that is associated fetal trisomy. The Mann-Whitney test is performed between all fragments less that 150 bp and fragments greater than 170 bp, although the authors stated that the approach is fairly robust to changes in these cutoffs. The authors claim that the major advantage of this approach is that is does not require a reference set of controls to define the expected characteristics of euploid pregnancies. The purpose of this report is to compare the performance of this approach to the log-likelihood approach used by Brigid.
The dataset used for this analysis was composed of ten flowcells previously used for a methods comparison study (See BRIGID-0233) between Brigid and VerifiPlus (Table 1). The output of the Brigid pipeline was provided in ten excel spreadsheets. Each Excel spreadsheet had three tabs, which were labeled: Summary, Data, and Misindexed. Each of the Data tabs were saved as a csv file and were combined to create a single data set. Only samples with an empty QCFailure field were included in the final data set.
# Load data from method comparison study
files = list.files("datasets", pattern="*AllSummary.csv", full.names = T)
cadas.data = lapply(files, read.csv) %>%
Reduce(rbind, .) %>%
dplyr::filter(QCFailure == "") %>%
as.data.frame
table(cadas.data$Flowcell) %>%
as.data.frame %>%
{names(.) = c("Flowcell", "Samples"); .} %>%
kable(caption="Table 1. Flowcells in the comparison study and number of samples that passed QC.")
| Flowcell | Samples |
|---|---|
| H7V3GBGXY | 46 |
| H7VW2BGXY | 46 |
| HC5TJBGXY | 47 |
| HFYGNBGXY | 47 |
| HG2L3BGXY | 47 |
| HG3GVBGXY | 46 |
| HG3MKBGXY | 43 |
| HG3N3BGXY | 47 |
| HGCGCBGXY | 47 |
| HGGG2BGXY | 46 |
The implementation of the COFFEE algorithm takes the same “short_reads” and “long_reads” input data used by the Brigid pipeline, which is a matrix of bin counts for each sample. Short fragments are all fragments less that 150 bp and long fragments are all fragments greater than 170 bp. The raw bin count data was generated on the cluster using an adaptation of the fragment counting and fragment aggregation scripts from the Brigid pipeline. The data for the short and long reads from the individual flowcells were combined into a single dataset with 30,970 bins and 480 samples.
# Commented this out to save RMarkdown compile time. The coffee.input variable is used
# to create coffee.data, which is loaded from disk below, making this step unneccesary.
#coffee.input = create.coffee.input()
coffee.input = readRDS("datasets/coffee.input.rds")
The bin counts were normalized to total fragment count and corrected for GC bias using the cADAS computeBinCov function. There was a slight discrepancy between the 100kb bins defined in the bed file used by the pipeline and the 100kb bins in the config file. The bins in the bed file referenced by the pipeline had a 24 bp offset at the beginning of each bin, whereas the the bed file within the config file started at 1 (no offset). Furthermore, although both bed files had the same number of bins, the pipeline bed file had a single bin for mitochondrial DNA and the config bed file had an extra bin for the end of chromosome Y. To resolve these differences the 24 base pair offset was removed and the mitochondrial DNA bin was replaced with the extra chromosome Y bin. All the values of the extra chromosome Y bin were set to NA. After the normalized and corrected bin coverage was computed for long and short reads, the one-sided paired Mann-Whitney p-values were calculated between them.
# Loading this from disk to save compile time. Originally created with this call:
#coffee.data = coffee(config, coffee.input)
coffee.data = readRDS('datasets/coffee.data.rds')
The clinical outcomes of the samples from the methods comparisons study are unknown so a comparison of the performance characteristics of the two approaches could not be performed. The relative agreement of the two methods was examined by comparing the log-likelihood scores from Brigid to the log COFFEE p-values (Figure 1). The correlation between the samples that were above the thresholds for both methods (LLR > 1.5, COFFEE > 3) was 0.85, suggesting that the two methods are similar for samples that are likely be positive for fetal trisomy 21. The right panel of Figure 1 shows a zoomed in view of the decision boundaries. There were five samples that were called positive by LLR but not COFFEE. There were four samples called by COFFEE and not by LLR.
merged.data = merge.cadas.coffee(cadas.data, coffee.data)
comp.plot(merged.data, '21')
Figure 1. Comparison of the cADAS log likelihood and COFFEE -log(p-value) for chromosome 21. The right panel shows a zoomed in version of the the same data on the left panel
tab = table(merged.data$ScoreT_21 > 1.5, merged.data$COFFEE_21 > coffee.thresh) %>%
{
class(.) = 'matrix'
rownames(.) = sprintf("**LLR_%s**", gsub("ALSE|RUE", "", rownames(.)))
colnames(.) = sprintf("COFFEE_%s", gsub("ALSE|RUE", "", colnames(.)))
.
}
The confusion matrix summarizing these differences is shown in Table 2. The positive agreement between the two approaches was 78/87 (89.7%).
kable(tab, caption="Table 2. Confusion matrix between LLR and COFFEE. The suffix _T indicates a positive call and _F indicates a negative call")
| COFFEE_F | COFFEE_T | |
|---|---|---|
| LLR_F | 375 | 4 |
| LLR_T | 5 | 78 |
Non-excluded sites (NES) and fetal fraction (FF) both affect the ability to detect fetal aneuploidy. In simple terms, FF is proportional to the signal and NES is inversely proportional to the noise. Therefore, samples that have low values for these measures are more likely to be discordant between the two methods. Figure 2 shows the NES vs. FF for all the positive calls by either approach, which reveals that the nine discordant calls cluster towards the bottom left of the plot. Two of the samples missed by COFFEE have FF less than 5% and below the 12th percentile for NES. The remaining three samples missed by COFFEE all have FF less than 7.5%, suggesting that if these Brigid calls were correct, that this implementation of COFFEE is less sensitive than LLR. A more detailed analysis of each of these discordant samples is presented in the appendix.
discord = nes.ff.plot(merged.data, '21')
The data from the Brigid pipeline did not have any gender calls, so a comparison of NCV_Y and the COFFEE p-values was performed. These two measures are in strong agreement.
scatter.plot(merged.data$COFFEE_Y, merged.data$NCV_Y, include.ab=F, include.cor=F,
ylab="NCV_Y", xlab="COFFEE (-log(p-value))")
Figure 3. Comaprison of NCD_Y and COFFEE p-values from COFFEE algorithm.
The COFFEE publication claims that GC correction is not required for their approach, presumably because they are making paired comparisons within a bin. The analysis in this report used GC corrected data for the COFFEE algorithm. A comparison of the two approaches when GC correction is not performed prior to calculating the MW statistics is shown in Figure 4. The agreement with Brigid LLR scores decreases when GC correction is not performed, which is not surprising given that GC correction is part of the Brigid pipeline. There are no longer any samples that are called positive by COFFEE that are called negative by Brigid. The number of samples called positive by Brigid but not called by COFFEE increased to 10 (Table 3).
coffee.data.noGC = readRDS("datasets/coffee.data.noGC.rds")
merged.data.noGC = merge.cadas.coffee(cadas.data, coffee.data.noGC)
comp.plot(merged.data.noGC, '21')
Figure 2. Comparison of cADAS LLR and COFFEE p-value when no GC correction is performed prior to COFFEE M-W test.
table(merged.data.noGC$ScoreT_21 > 1.5, merged.data.noGC$COFFEE_21 > coffee.thresh) %>%
{
class(.) = 'matrix'
rownames(.) = sprintf("**LLR_%s**", gsub("ALSE|RUE", "", rownames(.)))
colnames(.) = sprintf("COFFEE_%s", gsub("ALSE|RUE", "", colnames(.)))
.
} %>%
kable(caption="Table 3. Confusion matrix between LR and COFFEE with no GC Correction The suffix _T indicates a positive call and _F indicates a negative call")
| COFFEE_F | COFFEE_T | |
|---|---|---|
| LLR_F | 379 | 0 |
| LLR_T | 10 | 73 |
Fetal cfDNA is smaller on average than maternal cfDNA. Therefore, in the presence of a fetal trisomy, the increase in observed reads counts will be more pronounced in the reads less than 150 base pairs. The Brigid pipeline and the COFFEE algorithm leverage this fact in different ways. The Brigid pipeline processes the short fragments (80-150 bp) and long fragments (150-300) separately. It calculates a t-test between the normalized bin counts on the target chromosome (e.g. 13, 18, 21) and other chromosomes from the same individual that are assumed to be euploid. The t-statistics from the short and long reads are used to calculate a log-likelihood using parameters defined by a large training set. Normalizing the bin counts across all chromosomes is crucial for this method because the t-statistic is computed by comparing values between bins (target chr vs. reference chr).
The COFFEE algorithm employs a different approach. It doesn’t attempt to detect an overall increase in read counts derived from a single chromosome. Instead, it only focuses on detecting an over-representation of small reads within a chromosome, which is derived from an increase in the relative proportion of fetal cfDNA. Therefore, all comparisons are made within a bin. This greatly reduces the need for accurate normalization because much of the bin-related bias affects both the small and large read counts similarly.
Although the COFFEE algorithm is less sophisticated, the scores are very similar the scores provided by the Brigid pipeline. One limitation of this study is that the clinical outcomes of these patients are not known. Therefore it is difficult to draw any conclusions about which method is performing better. Given that both methods differ slightly in their approach to handling the shift in the read size distribution, it is possible than an ensemble approach that incorporates both techniques may have merit. If the performance of both approaches can be evaluated using clinical outcomes, then the relative performance of each approach under sub-optimal conditions, such as low NES or FF, can be evaluated. If one method out performs the other in these scenarios then perhaps a reflex or combination of the two approaches can be considered.
What makes this approach unique is that it only considers the difference between short and long fragments within a single chromosome from a single individual, without the need for a large reference training set or reference chromosomes from the same individual. The COFFEE algorithm uses a paired Mann-Whitney test, which is a non-parametric test that evaluates the differences between two values. Other approaches exist for looking at paired differences, such as the t-test (parametric), which are likely covered by a similar set of claims. It would be interesting to note how broad the claims are regarding looking at paired differences within a bin. For example, do they cover machine learning techniques using bin size distributions as features.
Sun, K., Chan, K. C. A., Hudecova, I., Chiu, R. W. K., Lo, Y. M. D. and Jiang, P. (2017) ‘COFFEE: control-free noninvasive fetal chromosomal examination using maternal plasma DNA’, Prenatal Diagnosis, pp. 336–340. doi: 10.1002/pd.5016.
There were 4 samples called positive by COFFEE but not by Brigid. These samples are listed in the following table and bin plots for each sample are shown below.
coffee.only = discord[ discord$Discord == 'COFFEE Only' ,]
kable(coffee.only[, c("FF", "NES", "NCV_21", "ScoreT_21", "COFFEE_21")],
digits=2)
| FF | NES | NCV_21 | ScoreT_21 | COFFEE_21 | |
|---|---|---|---|---|---|
| H7VW2BGXY_762192 | 0.11 | 8.0M | 1.10 | -7.81 | 3.01 |
| HC5TJBGXY_685618 | 0.18 | 8.5M | 4.88 | -10.26 | 3.64 |
| HG2L3BGXY_146124863 | 0.09 | 7.7M | 0.54 | -4.36 | 4.02 |
| HG3N3BGXY_146125876 | 0.08 | 6.4M | 0.80 | -2.70 | 6.03 |
# Need just normalized counts for these plots
ylim.short = c(0, .0001)
ylim.long= c(00, .0001)
xlim = c(24500, 26600)
ylim.short = NULL
ylim.long= NULL
par(mfrow=c(1,2))
sample = row.names(coffee.only)[1]
genome.bin.plot(coffee.data$bincov$short_reads, sample, config,
mtype='short_reads', ylim=ylim.short, xlim=xlim);
genome.bin.plot(coffee.data$bincov$long_reads, sample, config,
mtype='long_reads', ylim=ylim.long, xlim=xlim);
sp()
par(mfrow=c(1,2))
sample = row.names(coffee.only)[2]
genome.bin.plot(coffee.data$bincov$short_reads, sample, config,
mtype='short_reads', ylim=ylim.short, xlim=xlim);
genome.bin.plot(coffee.data$bincov$long_reads, sample, config,
mtype='long_reads', ylim=ylim.long, xlim=xlim);
sp()
par(mfrow=c(1,2))
sample = row.names(coffee.only)[3]
genome.bin.plot(coffee.data$bincov$short_reads, sample, config,
mtype='short_reads', ylim=ylim.short, xlim=xlim);
genome.bin.plot(coffee.data$bincov$long_reads, sample, config,
mtype='long_reads', ylim=ylim.long, xlim=xlim);
sp()
par(mfrow=c(1,2))
sample = row.names(coffee.only)[4]
genome.bin.plot(coffee.data$bincov$short_reads, sample, config,
mtype='short_reads', ylim=ylim.short, xlim=xlim);
genome.bin.plot(coffee.data$bincov$long_reads, sample, config,
mtype='long_reads', ylim=ylim.long, xlim=xlim);
There were 5 samples called positive by COFFEE but not by Brigid. These samples are listed in the following table and bin plots for each sample are listed below.
llr.only = discord[ discord$Discord == 'LLR Only' ,]
kable(llr.only[, c("FF", "NES", "NCV_21", "ScoreT_21", "COFFEE_21")],
digits=2)
| FF | NES | NCV_21 | ScoreT_21 | COFFEE_21 | |
|---|---|---|---|---|---|
| H7VW2BGXY_570471 | 0.07 | 13.3M | 9.04 | 82.92 | 2.15 |
| HFYGNBGXY_181953786 | 0.04 | 8.6M | 4.81 | 15.29 | 2.31 |
| HG3MKBGXY_193892255 | 0.04 | 5.0M | 3.81 | 8.51 | 2.22 |
| HGCGCBGXY_501364 | 0.03 | 5.3M | 6.81 | 18.02 | 2.75 |
| HGGG2BGXY_186213032 | 0.06 | 8.3M | 5.19 | 9.66 | 0.53 |
par(mfrow=c(1,2))
sample = row.names(llr.only)[1]
genome.bin.plot(coffee.data$bincov$short_reads, sample, config,
mtype='short_reads', ylim=ylim.short, xlim=xlim);
genome.bin.plot(coffee.data$bincov$long_reads, sample, config,
mtype='long_reads', ylim=ylim.long, xlim=xlim);
par(mfrow=c(1,2))
sample = row.names(llr.only)[2]
genome.bin.plot(coffee.data$bincov$short_reads, sample, config,
mtype='short_reads', ylim=ylim.short, xlim=xlim);
genome.bin.plot(coffee.data$bincov$long_reads, sample, config,
mtype='long_reads', ylim=ylim.long, xlim=xlim);
par(mfrow=c(1,2))
sample = row.names(llr.only)[3]
genome.bin.plot(coffee.data$bincov$short_reads, sample, config,
mtype='short_reads', ylim=ylim.short, xlim=xlim);
genome.bin.plot(coffee.data$bincov$long_reads, sample, config,
mtype='long_reads', ylim=ylim.long, xlim=xlim);
par(mfrow=c(1,2))
sample = row.names(llr.only)[4]
genome.bin.plot(coffee.data$bincov$short_reads, sample, config,
mtype='short_reads', ylim=ylim.short, xlim=xlim);
genome.bin.plot(coffee.data$bincov$long_reads, sample, config,
mtype='long_reads', ylim=ylim.long, xlim=xlim);
par(mfrow=c(1,2))
sample = row.names(llr.only)[5]
genome.bin.plot(coffee.data$bincov$short_reads, sample, config,
mtype='short_reads', ylim=ylim.short, xlim=xlim);
genome.bin.plot(coffee.data$bincov$long_reads, sample, config,
mtype='long_reads', ylim=ylim.long, xlim=xlim);